On Integrating Error Detection into a Fault Diagnosis Algorithm for Massively Parallel Computers

نویسندگان

Jörn Altmann

Tamás Bartha

András Pataricza

چکیده

Scalable fault diagnosis is necessary for constructing fault tolerance mechanisms in large massively parallel multiprocessor systems. The diagnosis algorithm must operate efficiently even if the system consists of several thousand processors. In this paper we introduce an event-driven, distributed system-level diagnosis algorithm. It uses a small number of messages and is based on a general diagnosis model without the limitation of the number of simultaneously existing faults (an important requirement for massively parallel computers). The algorithm integrates both error detection techniques like messages, and built in hardware mechanisms. The structure of the implemented algorithm is presented, and the essential program modules are described. The paper also discusses the use of test results generated by error detection mechanisms for fault localization. Measurement results illustrate the effect of the diagnosis algorithm, in particular the error detection mechanism by messages, on the application performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An approach to fault detection and correction in design of systems using of Turbo ‎codes‎

We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...

متن کامل

An Event-driven Approach to Multiprocessor Diagnosis

For constructing fault tolerance mechanisms in large massively parallel multiprocessor systems, a scalable fault diagnosis is necessary, which works efficiently even if there are several thousand processors in the system. In this paper we present an event-driven, distributed system-level diagnosis algorithm, based on a general diagnosis model which does not limit the number of simultaneously ex...

متن کامل

A Software Implemented Fault-tolerance Layer for Reliable Computing on Massively Parallel Computers and Distributed Computing Systems

A novel architecture for a software-implemented fault-tolerance layer for application reliability on massively parallel computers and distributed computing systems is proposed. This is the rst attempt at providing a purely software-based, user-level solution for fault detection, reconnguration, and recovery in a parallel environment. The symmetrically distributed, multi-tiered layer envelopes u...

متن کامل

An Approach for Hierarchical System Level Diagnosis of Massively Parallel Computers Combined with a Simulation-Based Method for Dependability Analysis

The primary focus in the analysis of massively parallel supercomputers has traditionally been on their performance. However, their complex network topologies, large number of processors, and sophisticated system software can make them very unreliable. If every failure of one of the many components of a massively parallel computer could shut down the machine, the machine would be useless. Theref...

متن کامل

Reversible Logic Multipliers: Novel Low-cost Parity-Preserving Designs

Reversible logic is one of the new paradigms for power optimization that can be used instead of the current circuits. Moreover, the fault-tolerance capability in the form of error detection or error correction is a vital aspect for current processing systems. In this paper, as the multiplication is an important operation in computing systems, some novel reversible multiplier designs are propose...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

On Integrating Error Detection into a Fault Diagnosis Algorithm for Massively Parallel Computers

نویسندگان

چکیده

منابع مشابه

An approach to fault detection and correction in design of systems using of Turbo ‎codes‎

An Event-driven Approach to Multiprocessor Diagnosis

A Software Implemented Fault-tolerance Layer for Reliable Computing on Massively Parallel Computers and Distributed Computing Systems

An Approach for Hierarchical System Level Diagnosis of Massively Parallel Computers Combined with a Simulation-Based Method for Dependability Analysis

Reversible Logic Multipliers: Novel Low-cost Parity-Preserving Designs

عنوان ژورنال:

اشتراک گذاری